It’s difficult to overstate the extent to which the COVID-19 pandemic has tested the world’s public health infrastructure over the past two years. At the same time, the COVID-19 “experience” has manifested unequally – not just country to country, but city to city, and even borough to borough. As one of the most heterogeneous urban areas in the world, New York City provides a fascinating case study into the ways socioeconomic status may be associated with, or even mediate, disparities in health outcomes. (For instance, it’s already well-documented that income and race, along with socioeconomic privilege and political ideology, drive inequities in COVID vaccination rate across US cities.)
Knowing that socioeconomic factors have historically been associated with health outcomes, we aim to examine relationships between a range of predictors (e.g. race/ethnicity, education, broadband internet access, household income / occupational income score, public vs. private health insurance) and COVID-19 health outcomes – namely, hospitalizations, deaths, and vaccinations.
Our initial questions are centered on how demographic factors in New York City trend with COVID-19 outcomes. Particularly, what are the key demographic correlates (e.g. race/ethnicity, education) of COVID-19 hospitalizations or vaccinations? Do PUMAs with more individuals on public health insurance fair equally well on key outcomes compared to those with more individuals on private health insurance? How does household income fit into the equation? Broadly, we aim to explore the myriad relationships between predictors and COVID-19 outcomes across New York City census tracts and boroughs.
The Integrated Public Use Microdata Series (IPUMS USA) consists of individual-level data from samples of the US population drawn from the American Community Surveys (ACS) of 2000 - present as well as from fifteen federal censuses from 1850 to 2010. Each record in IPUMS is a person with numerically coded characteristics and “weight” variables indicating how many persons in the population are represented by each record. Samples were created at different times by different investigators, which lead to a variety of documentation conventions. However, IPUMS applies consistent coding and documentation across records to allow for effective analysis over time. A data extraction system exists to allow users to pull particular samples and variables from IPUMS. This project uses demographic and macroeconomic data from the American Community Survey (ACS) 2019 five-year estimate via IPUMS.
While data from IPUMS is recorded at the individual-level, each interview is coded to a particular Public Use Microdata Area (PUMA) geography where the census interviewee’s housing unit was located at the time of interview.
The New York City Department of Health and Mental Hygiene (NYC DOHMH) is one of the oldest public health agencies in the United States. Among other responsibilities, the DOHMH monitors the spread of infectious disease in NYC. The Department of Health classified the beginning of the COVID-19 pandemic as February 29, 2020, the date of the first laboratory-confirmed case. Since then, the DOHMH has recorded and reported COVID-19 data on a daily, weekly, or monthly basis. These data include cases, hospitalizations, and deaths by borough, modified Zip Code tabulation area (ZCTA), and demographic factors. As NYC has administered vaccinations for COVID-19, these data have been recorded and made available by borough, ZCTA, and demography. This project uses COVID-19 hospitalization rates, death rates, and vaccination rates by ZCTA in NYC.
The Baruch Geoportal, maintained by the Newman Library at Baruch College, is a repository of geospatial resources that includes tabular data sets, tutorials, maps, and crosswalks. This project uses a crosswalk data set from Baruch Geoportal to apportion NYC ZCTAs to PUMAs, enabling analysis of PUMA-coded data from IPUMS alongside COVID-19 outcome data from DOHMH at similar levels of granularity.
As mentioned above, monthly outcome data reported at the ZCTA-level geography were obtained from the NYC DOHMH. First, these data were summed over the time interval March 2020 - Sept 2021 to obtain one cumulative incidence measure per ZCTA. However, the predictor variables from IPUMS were coded to the PUMA-level geography, and so we used the Baruch ZCTA-PUMA crosswalk data set to convert our data into common geographical units.
The following columns were used to convert ZCTA-level outcome data to PUMA-level data:
zcta10: ZCTA unique identifier
puma10: PUMA unique identifier
per_in_puma: percentage of the specified ZCTA located within the specified PUMA
per_of_puma: percentage of the specified PUMA occupied by the specified ZCTA
The following shows the mathematical expression used to convert from ZCTA-level outcome data to PUMA-level outcome data:
\[ \sum_{i = 1}^{n} \text{zcta_outcome_data}_{i} \cdot \text{per_in_puma}_{i} \cdot \text{per_of_puma}_{i} = \text{puma_outcome_data} \]
This bridged us towards one cumulative incidence measure per PUMA (per 100,000 people for hospitalizations and deaths, and per 100 people for vaccinations).
We proceeded to clean our demographic dataset from IPUMS, which originally contained over 350,000 interviews from the NYC area:
# Cleaning
jimzip <- function(csv_file, path) {
# create full path to csv file
full_csv <- paste0(path, "/", csv_file)
# append ".zip" to csv file
zip_file <- paste0(full_csv, ".zip")
# unzip file
unzip(zip_file)
# read csv
data_extract <- read_csv(csv_file)
# be sure to remove file once unzipped (it will live in working directory)
on.exit(file.remove(csv_file))
# output data
data_extract
}
# Apply function to filtered census data CSV
census_data <- jimzip("census_filtered.csv", "./data")
# Merging the Outcome Data
# Read in PUMA outcomes data
health_data <-
read_csv("./data/outcome_puma.csv")
# Merge census data with PUMA outcomes data
merged_data <- merge(census_data, health_data, by = "puma")
# Deprecate census data alone
rm(census_data)
## Cleaning the Data
# Clean the merged census and outcomes data
# Each row represents one
cleaned_data =
merged_data %>%
# Remove variables less useful for analysis or redundant (high probability of collinearity with remaining variables)
select(-serial, -cluster, -strata, -multyear, -ancestr1, -ancestr2, -labforce, -occ, -ind, -incwage, -occscore, -pwpuma00, -ftotinc, -hcovpub) %>%
# Remove duplicate rows, if any
distinct() %>%
# Rename variables
rename(
borough = countyfip,
has_broadband = cihispeed,
birthplace = bpl,
education = educd,
employment = empstat,
personal_income = inctot,
work_transport = tranwork,
household_income = hhincome,
on_foodstamps = foodstmp,
family_size = famsize,
num_children = nchild,
US_citizen = citizen,
puma_vacc_rate = puma_vacc_per,
on_welfare = incwelfr,
poverty_threshold = poverty
) %>%
# Recode variables according to data dictionary
mutate(
# Researched mapping for county
borough = recode(
borough,
"5" = "Bronx",
"47" = "Brooklyn",
"61" = "Manhattan",
"81" = "Queens",
"85" = "Staten Island"
),
rent = ifelse(
rent == 9999, 0,
rent
),
household_income = ifelse(
household_income %in% c(9999998,9999999), NA,
household_income
),
on_foodstamps = recode(
on_foodstamps,
"1" = "No",
"2" = "Yes"
),
has_broadband = case_when(
has_broadband == "20" ~ "No",
has_broadband != "20" ~ "Yes"
),
sex = recode(
sex,
"1" = "Male",
"2" = "Female"
),
# Collapse Hispanic observation into race observation
race = case_when(
race == "1" ~ "White",
race == "2" ~ "Black",
race == "3" ~ "American Indian",
race %in% c(4,5,6) ~ "Asian and Pacific Islander",
race == 7 & hispan %in% c(1,2,3,4) ~ "Hispanic",
race == 7 & hispan %in% c(0,9) ~ "Other",
race %in% c(8,9) ~ "2+ races"
),
birthplace = case_when(
birthplace %in% 1:120 ~"US",
birthplace %in% 121:950 ~ "Non-US",
birthplace == 999 ~"Unknown"
),
US_citizen = case_when(
US_citizen %in% c(1,2) ~ "Yes",
US_citizen %in% 3:8 ~"No",
US_citizen %in% c(0,9) ~ "Unknown"
),
# Chose languages based on highest frequency observed
language = case_when(
language == "1" ~ "English",
language == "12" ~ "Spanish",
language == "43" ~ "Chinese",
language == "0" ~ "Unknown",
language == "31" ~ "Hindi",
!language %in% c(1,12,43,0,31) ~ "Other"
),
# Collapse multiple health insurance variables into single variable
health_insurance = case_when(
hcovany == 1 ~ "None",
hcovany == 2 && hcovpriv == 2 ~ "Private",
hcovany == 2 && hcovpriv == 1 ~ "Public"
),
education = case_when(
education %in% 2:61 ~ "Less Than HS Graduate",
education %in% 62:64 ~ "HS Graduate",
education %in% 65:100 ~ "Some College",
education %in% 110:113 ~ "Some College",
education == 101 ~ "Bachelor's Degree",
education %in% 114:116 ~ "Post-Graduate Degree",
education %in% c(0,1,999) ~ "Unknown"
),
employment = case_when(
employment %in% c(0,3) ~ "Not in labor force",
employment == 1 ~ "Employed",
employment == 2 ~ "Unemployed"
),
personal_income = ifelse(
personal_income %in% c(9999998,9999999), NA,
personal_income
),
household_income = ifelse(
household_income %in% c(9999998,9999999), NA,
household_income
),
on_welfare = case_when(
on_welfare > 0 ~ "Yes",
on_welfare == 0 ~ "No"
),
poverty_threshold = case_when(
poverty_threshold >= 100 ~ "Above",
poverty_threshold < 100 ~ "Below"
),
work_transport = case_when(
work_transport %in% c(31:37, 39) ~ "Public Transit",
work_transport %in% c(10:20, 38) ~ "Private Vehicle",
work_transport == 50 ~ "Bicycle",
work_transport == 60 ~ "Walking",
work_transport == 80 ~ "Worked From Home",
work_transport %in% c(0, 70) ~ "Other"
)
) %>%
mutate(education = fct_relevel(education, "Less Than HS Graduate", "HS Graduate", "Some College", "Bachelor's Degree", "Post-Graduate Degree", "Unknown")) %>%
# Eliminate columns no longer needed after transformation
select(-hispan, -hcovany, -hcovpriv) %>%
# Relocate new columns
relocate(health_insurance, .before = personal_income) %>%
relocate(poverty_threshold, .before = work_transport) %>%
relocate(on_welfare, .before = poverty_threshold) %>%
relocate(perwt, .before = hhwt) %>%
# Create factor variables where applicable
mutate(across(.cols = c(puma, borough, on_foodstamps, has_broadband, sex, race, birthplace, US_citizen, language, health_insurance, education, employment, on_welfare, poverty_threshold, work_transport), as.factor)) %>%
# Ensure consistent use of percentages
mutate(
puma_death_rate = puma_death_rate / 1000,
puma_hosp_rate = puma_hosp_rate / 1000
)
# View first few rows
cleaned_data %>%
head() %>%
knitr::kable()
| puma | perwt | hhwt | borough | rent | household_income | on_foodstamps | has_broadband | family_size | num_children | sex | age | race | birthplace | US_citizen | language | education | employment | health_insurance | personal_income | on_welfare | poverty_threshold | work_transport | puma_death_rate | puma_hosp_rate | puma_vacc_rate |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 3701 | 18 | 25 | Bronx | 0 | 166870 | No | No | 2 | 0 | Male | 58 | White | Non-US | Yes | English | Post-Graduate Degree | Employed | Private | 83435 | No | Above | Public Transit | 0.3982366 | 1.064625 | 55.79213 |
| 3701 | 22 | 22 | Bronx | 1018 | 123192 | No | Yes | 2 | 0 | Male | 52 | Black | Non-US | Yes | Other | Some College | Employed | Private | 107920 | No | Above | Private Vehicle | 0.3982366 | 1.064625 | 55.79213 |
| 3701 | 16 | 17 | Bronx | 2000 | 125000 | No | Yes | 1 | 0 | Female | 59 | White | US | Unknown | Spanish | Some College | Employed | Private | 125000 | No | Above | Public Transit | 0.3982366 | 1.064625 | 55.79213 |
| 3701 | 4 | 6 | Bronx | 0 | 140588 | Yes | Yes | 6 | 3 | Male | 64 | Asian and Pacific Islander | Non-US | Yes | Chinese | Post-Graduate Degree | Not in labor force | Private | 11264 | No | Above | Other | 0.3982366 | 1.064625 | 55.79213 |
| 3701 | 20 | 19 | Bronx | 1058 | 52876 | No | Yes | 2 | 1 | Male | 44 | White | Non-US | Yes | Spanish | Some College | Employed | Private | 32373 | No | Above | Public Transit | 0.3982366 | 1.064625 | 55.79213 |
| 3701 | 9 | 9 | Bronx | 0 | 26101 | No | Yes | 1 | 0 | Female | 64 | White | US | Unknown | English | Post-Graduate Degree | Employed | Private | 26101 | No | Above | Private Vehicle | 0.3982366 | 1.064625 | 55.79213 |
Because our outcomes data was minimially aggregated at the PUMA level and could not be disaggregated further, we decided it would be prudent to create a second cleaned data set for which each observation corresponded to a given PUMA, rather than a given census observation. To do so, we needed to use perwt (person weight, describing the number of persons in the coded area for whom the interview was representative) and hhwt (household weight, which plays a similar role but by household rather than by person in each geographic area) to weight our key census variables at the aggregated PUMA level.
# Example data frame with weightings for summary stats over each PUMA
nyc_puma_summary = cleaned_data %>%
# Note: do we need to filter to one individual per household for household weightings?
group_by(puma) %>%
summarize(
total_people = sum(perwt),
median_household_income = weighted.median(household_income, hhwt, na.rm = TRUE),
perc_foodstamps = sum(hhwt[on_foodstamps == "Yes"]) * 100 / sum(hhwt),
perc_broadband = sum(hhwt[has_broadband == "Yes"]) * 100 / sum(hhwt),
perc_male = sum(perwt[sex == "Male"]) * 100 / sum(perwt),
median_age = weighted.median(age, perwt, na.rm = TRUE),
perc_white = sum(perwt[race == "White"]) * 100 / sum(perwt),
perc_foreign_born = sum(perwt[birthplace == "Non-US"]) * 100 / sum(perwt),
perc_citizen = sum(perwt[US_citizen == "Yes"]) * 100 / sum(perwt),
perc_english = sum(perwt[language == "English"]) * 100 / sum(perwt),
perc_college = sum(perwt[education %in% c("Some College", "Bachelor's Degree", "Post-Graduate Degree")]) * 100 / sum(perwt),
perc_unemployed = sum(perwt[employment == "Unemployed"]) * 100 / sum(perwt),
perc_insured = sum(perwt[health_insurance %in% c("Private", "Public")]) * 100 / sum(perwt),
median_personal_income = weighted.median(personal_income, perwt, na.rm = TRUE),
perc_welfare = sum(perwt[on_welfare == "Yes"]) * 100 / sum(perwt),
perc_poverty = sum(perwt[poverty_threshold == "Below"]) * 100 / sum(perwt),
perc_public_transit = sum(perwt[work_transport == "Public Transit"]) * 100 / sum(perwt),
covid_hosp_rate = median(puma_hosp_rate),
covid_vax_rate = median(puma_vacc_rate),
covid_death_rate = median(puma_death_rate)
)
# View first few rows of PUMA-summarized data frame
nyc_puma_summary %>%
head() %>%
knitr::kable()
| puma | total_people | median_household_income | perc_foodstamps | perc_broadband | perc_male | median_age | perc_white | perc_foreign_born | perc_citizen | perc_english | perc_college | perc_unemployed | perc_insured | median_personal_income | perc_welfare | perc_poverty | perc_public_transit | covid_hosp_rate | covid_vax_rate | covid_death_rate |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 3701 | 110016 | 69000 | 24.49306 | 89.50725 | 46.88500 | 39 | 46.53050 | 34.50407 | 20.26887 | 41.74120 | 47.96393 | 3.449498 | 93.91725 | 22357.09 | 20.21706 | 22.66943 | 24.79821 | 1.0646245 | 55.79213 | 0.3982366 |
| 3702 | 151067 | 68610 | 28.24215 | 82.93904 | 45.12435 | 36 | 14.06925 | 42.97100 | 27.86909 | 66.47183 | 39.83729 | 4.534412 | 92.58276 | 20859.00 | 22.08623 | 18.60036 | 21.55732 | 1.1782180 | 47.19203 | 0.2688261 |
| 3703 | 119529 | 77588 | 15.90074 | 86.60915 | 46.66734 | 43 | 41.20088 | 22.69324 | 16.91138 | 58.42432 | 45.13131 | 3.577374 | 94.11273 | 25030.00 | 16.73067 | 16.28140 | 19.85292 | 1.3040106 | 58.33806 | 0.4052719 |
| 3704 | 126664 | 64662 | 23.88688 | 81.56041 | 47.66311 | 37 | 32.72911 | 37.88685 | 23.03812 | 41.10165 | 40.30664 | 4.141666 | 92.33879 | 20362.00 | 20.17700 | 18.97698 | 23.55681 | 0.6861973 | 29.38498 | 0.2092428 |
| 3705 | 171016 | 36500 | 50.66455 | 87.59563 | 46.19509 | 30 | 15.45177 | 33.49043 | 16.46103 | 33.31033 | 28.35290 | 5.289563 | 91.41484 | 11264.00 | 29.20428 | 39.26943 | 23.03995 | 0.8255382 | 33.17926 | 0.2018467 |
| 3706 | 131986 | 43490 | 46.18132 | 82.15610 | 48.26724 | 32 | 19.37099 | 46.80496 | 20.58855 | 20.86812 | 29.15536 | 5.606655 | 89.18219 | 15000.00 | 25.09812 | 32.16402 | 28.97883 | 0.7723878 | 32.95661 | 0.1812561 |
This cleaned and aggregated data set has 55 rows, one for each PUMA, and is the basis for much of the exploratory analysis that follows.
Ultimately, we ended the data cleaning process with two data sets at different levels of aggregation: one at the census interview level, and another at the PUMA level. Most of our exploratory analysis focused on evaluating predictors and outcomes by PUMA, although the interview-level data was still helpful for us to determine outcome disparities across predictors city-wide.
We began by seeking a better understanding of how each of our three key outcomes – hospitalization rate, death rate, and vaccination rate – differ by PUMA.
When PUMAs are ranked from lowest outcome rate to highest outcome rate, and colored by borough, we find the following, as visualized in the graphs below:
Although our regression modeling explores the relationship between particular variable pairings more fully to check for collinearity, we also were interested in observing how associated our key outcome variables were.
We found that hospitalization and death rate were significantly correlated, whereas vaccination had limited correlation with both hospitalization (0.187) and death (0.075) at the PUMA level. Interestingly, the correlation between our key outcome variables was effectively modified by borough; for example, vaccination and hospitalization were significantly correlated in the Bronx (0.930), whereas hospitalization and death were not significantly correlated in Staten Island (0.526).
After exploring outcomes at the PUMA level, we were keen to dive more deeply into disparities by borough. Beyond simply confirming with boxplots the distribution of key outcomes across PUMAs in a given borough (e.g. hospitalization and death distributed towards higher rates in Queens and Bronx PUMAs, vaccination distributed towards higher rates in Queens and Manhattan), we were also curious to see what would happen if we took the median city-wide PUMA for each outcome, and binarily divided the PUMAs in each borough into those above or below that median. Although we’re working with a small sample of only 55 PUMAs, which makes our borough percentages highly sensitive to single-PUMA changes, we find – to take one example – that ~79% of PUMAs in Queens are above the city-wide median PUMA on death rate, but ~86% of PUMAs in Queens are above the city-wide median PUMA on vaccination rate.
For each unique combination of age group, race, and sex, we wanted to determine which demographic category performed best and worst on each key outcome, and populated the following tables:
# Lowest hospitalization rates
race_age_sex %>%
filter(outcome == "hosp_rate") %>%
mutate(
outcome_rate = outcome_rate
) %>%
arrange(outcome_rate) %>%
select(race, age_class, sex, outcome_rate) %>%
head() %>%
knitr::kable(
caption = "Lowest hospitalization rates",
col.names = c("Race", "Age", "Sex", "% Hospitalized")
)
| Race | Age | Sex | % Hospitalized |
|---|---|---|---|
| White | 21-30 | Female | 0.8761509 |
| White | 31-40 | Male | 0.8859089 |
| White | 31-40 | Female | 0.8908692 |
| White | 21-30 | Male | 0.8963492 |
| White | 41-50 | Male | 0.9275371 |
| White | 41-50 | Female | 0.9347574 |
# Highest hospitalization rates
race_age_sex %>%
filter(outcome == "hosp_rate") %>%
mutate(
outcome_rate = outcome_rate
) %>%
arrange(desc(outcome_rate)) %>%
select(race, age_class, sex, outcome_rate) %>%
head() %>%
knitr::kable(
caption = "Highest hospitalization rates",
col.names = c("Race", "Age", "Sex", "% Hospitalized")
)
| Race | Age | Sex | % Hospitalized |
|---|---|---|---|
| American Indian | 81-90 | Male | 1.272494 |
| Other | 61-70 | Male | 1.216953 |
| Other | 11-20 | Male | 1.211757 |
| Other | 41-50 | Male | 1.200270 |
| Other | 61-70 | Female | 1.185197 |
| Asian and Pacific Islander | 91-100 | Male | 1.173874 |
# Lowest death rates
race_age_sex %>%
filter(outcome == "death_rate") %>%
mutate(
outcome_rate = outcome_rate
) %>%
arrange(outcome_rate) %>%
select(race, age_class, sex, outcome_rate) %>%
head() %>%
knitr::kable(
caption = "Lowest death rates",
col.names = c("Race", "Age", "Sex", "% Deceased")
)
| Race | Age | Sex | % Deceased |
|---|---|---|---|
| White | 21-30 | Female | 0.2376629 |
| White | 31-40 | Male | 0.2424339 |
| White | 21-30 | Male | 0.2435737 |
| White | 31-40 | Female | 0.2455473 |
| Other | 81-90 | Male | 0.2541018 |
| White | 41-50 | Male | 0.2564677 |
# Highest death rates
race_age_sex %>%
filter(outcome == "death_rate") %>%
mutate(
outcome_rate = outcome_rate
) %>%
arrange(desc(outcome_rate)) %>%
select(race, age_class, sex, outcome_rate) %>%
head() %>%
knitr::kable(
caption = "Highest death rates",
col.names = c("Race", "Age", "Sex", "% Deceased")
)
| Race | Age | Sex | % Deceased |
|---|---|---|---|
| 2+ races | 91-100 | Male | 0.3525028 |
| Asian and Pacific Islander | 91-100 | Male | 0.3386789 |
| American Indian | 81-90 | Male | 0.3246158 |
| Other | 11-20 | Male | 0.3228600 |
| Other | 41-50 | Male | 0.3220782 |
| Other | 71-80 | Male | 0.3202419 |
# Lowest vax rates
race_age_sex %>%
filter(outcome == "vax_rate") %>%
mutate(
outcome_rate = outcome_rate
) %>%
arrange(outcome_rate) %>%
select(race, age_class, sex, outcome_rate) %>%
head() %>%
knitr::kable(
caption = "Lowest vaccination rates",
col.names = c("Race", "Age", "Sex", "% Vaccinated")
)
| Race | Age | Sex | % Vaccinated |
|---|---|---|---|
| American Indian | 71-80 | Male | 49.32433 |
| Black | 11-20 | Female | 49.35445 |
| Black | <11 | Male | 49.47578 |
| Black | 11-20 | Male | 49.48017 |
| Black | 31-40 | Female | 49.49496 |
| Black | <11 | Female | 49.51926 |
# Highest vax rates
race_age_sex %>%
filter(outcome == "vax_rate") %>%
mutate(
outcome_rate = outcome_rate
) %>%
arrange(desc(outcome_rate)) %>%
select(race, age_class, sex, outcome_rate) %>%
head() %>%
knitr::kable(
caption = "Highest vaccination rates",
col.names = c("Race", "Age", "Sex", "% Vaccinated")
)
| Race | Age | Sex | % Vaccinated |
|---|---|---|---|
| Asian and Pacific Islander | 91-100 | Female | 66.13596 |
| Asian and Pacific Islander | 71-80 | Female | 65.89982 |
| Asian and Pacific Islander | 71-80 | Male | 65.73426 |
| Asian and Pacific Islander | 31-40 | Male | 65.61553 |
| Asian and Pacific Islander | 31-40 | Female | 65.57662 |
| 2+ races | 81-90 | Male | 65.47910 |
Overall, hospitalization and death rates were lowest (best) among white males and females under age 30, whereas vaccination rates were highest (best) among Asian and Pacific Islander males and females. Older American Indian individuals, along with younger and middle-aged Black individuals, tended to have the lowest vaccination rates, while mixed race, American Indian, and “other” racial groups tended to have higher hospitalization and death rates.
After exploring disparities in key outcomes across PUMAs and boroughs, we turn towards our large set of predictors, and begin by trying to determine which predictors to focus on using a correlation matrix (with outcomes). In the plot below, we include text only on those tiles that are highly significant, with p < 0.01.
Interestingly, we find:
All in all, there are a couple of interesting trends observed in this correlation matrix. First, income seems strongly associated with all three outcome variables. In addition, the variables strongly associated with hospitalization rate also tend to be strongly associated with death rate, which is not much of a surprise given the high correlation already observed between hospitalization and death rate across PUMAs. Finally, we note that many signifiers of socioeconomic status – including poverty level, welfare, unemployment, and food stamp utilization – seem highly correlated with vaccination, but not with hospitalization and death. Given that vaccination is a more “active” outcome (requiring deliberate action) here than hospitalization or death, which are both the result of (in all likelihood) “passive” or indeliberate transmission, the association of socioeconomic factors with vaccination begin to indicate the potential impact from significant structural inequalities at play, hindering healthcare access among the poor.
We then selected each of the four variables with highest correlation (positive or negative) to each outcome, excluding obvious redundancies (like personal income and household income), and explored specific association trends between predictor and outcome, colored by borough.
Beyond the predictors significantly associated with each outcome, we wanted to focus as well on how outcomes varied by levels of key socioeconomic variables – namely, race, age group, and sex. Because we lack individual outcome data (i.e. each census observation within a given PUMA has the same PUMA-level hospitalization, death, and vaccination rate), we assumed for this analysis that all persons in a given PUMA had equal likelihood of a particular outcome (hospitalization, death, or vaccination) being true, with the likelihood corresponding to the PUMA outcome rate.
Some key findings include:
Similarly, we examined how key outcomes varied across categories of a few seemingly important predictor variables for each outcome observed in our correlation matrix:
Again, there were a few interesting findings here, such as:
To build on the work above, we wanted to observe disparities within boroughs, knowing that different boroughs have different demographic compositions and that the ideal visualizations would allow us to understand how outcomes vary conditioned on borough demographic composition. Each of the following visualizations has three panels: number of people with outcome, categorized by borough and colored by predictor level; % of people with outcome in each borough, colored by predictor level, compared to the overall composition of the borough by predictor level; and percent with outcome variable, in each borough, plotted by level of predictor.
The most interesting finding here is that Manhattan seems to have the most significant racial disparities on hospitalization rate and death rate, along with age disparities on hospitalization rate, and racial and age disparities in vaccination rate. In general, the level of inequality on key outcomes across demographics in Manhattan tends to be higher than in other boroughs.
outcome_by_year <-
read_csv("./modeling/outcome_puma_by_year.csv") %>%
rename(puma = puma10)
unbiased_data <- read_csv("./modeling/unbiased_group_means.csv") %>%
merge(outcome_by_year, by = "puma")
best_model <- lm(puma_hosp_rate_2020 ~ language_spanish +
education_bachelors_degree +
birthplace_us + health_insurance_public + language_english +
personal_income + employment_not_in_labor_force +
health_insurance_public:personal_income +
education_bachelors_degree:personal_income +
language_english:birthplace_us, data = unbiased_data)
full_model <- lm(puma_hosp_rate_2020 ~
(language_spanish +
education_bachelors_degree +
birthplace_us + health_insurance_public + language_english +
personal_income + employment_not_in_labor_force)^2,
data = unbiased_data)
Let \(S\) denote the set of PUMAs in NYC. We consider the latent variable model \[ y_s = \xi_s' \beta + \varepsilon_s \] for each \(s \in S\), where \(\xi_s\) a vector of PUMA-level means of data from the census. There are two challenges in considering such a model: (1) we know historically that NYC neighborhoods, particularly those within boroughs, are subject to spatial correlation; and (2) because the census data is recorded at the interview level, we do not observe the group-level means \(\xi_s\).
In order to address (1), we employ heteroscedasticity-robust White standard errors to deal with the potential threat of spatial correlation. We find that all of our coefficient estimates retain their significance under the adjusted standard errors, and hence we leave this check on robustness to the appendix.
More interesting is the challenge presented in (2). For each \(s \in S\), suppose we observe \(i \in \{ 1, \ldots, n_s \}\) individual-level interviews. It is natural to consider the group mean \[ \bar{X}_s \equiv \frac{1}{n_s} \sum_{i = 1}^{n_s} X_{is}. \] However, as shown in Croon and Veldhoven (2007), it is generally the case that using the observed group mean \(\bar{X}_s\) as an estimator of \(\xi_s\) will lead to biased regression coefficients. Thus, we address (2) by following the procedure outlined by Croon and Veldhoven to re-weight our group means and obtain adjusted group means \(\tilde{X}_s\). In particular, let $ _{}$ and \(\hat{\Sigma}_{\nu \nu}\) denote the usual ANOVA estimates of the between- and within-group variation matrices, respectively. The adjusted group means are then given by \[ \tilde{X}_s \equiv \bar{X}' (I - W_s) + \bar{X}_s' W_s, \quad \text{where} \] \[ W_s \equiv \left( \hat{\Sigma}_{\xi \xi} + \hat{\Sigma}_{\nu \nu} / n_s \right)^{-1} \hat{\Sigma}_{\xi \xi} \] and \(I\) denotes the identity matrix.
lm_spec <- linear_reg() %>%
set_mode("regression") %>%
set_engine("lm")
best_model_tidy <- fit(lm_spec, puma_hosp_rate_2020 ~ language_spanish +
education_bachelors_degree +
birthplace_us + health_insurance_public + language_english +
personal_income + employment_not_in_labor_force +
health_insurance_public:personal_income +
education_bachelors_degree:personal_income +
language_english:birthplace_us, data = unbiased_data)
check_model(best_model_tidy,
check = c("linearity", "qq", "normality", "outliers", "homogeneity"))
Linearity: It is to be observed that the relationship between independent and dependent variables is fairly linear in our model. Thus, we claim that the mean of the hospitalization rate in 2020 (outcome) is a linear combination of the model parameters and the predictor variables.
Equal Variance (homoscedasticity): Based on the residual vs fitted plot, it is to be observed that the points in the plot are randomly dispersed around the horizontal line at 0 without showing a certain trend. This implies no violation of homoscedasticity.
Normality of residuals: According to the normality probability plot, it is to be observed most of the points fall along the line. Therefore, we claim that residuals are normally distributed in our model.
Influential points among outliers: According to residuals vs leverage plot, it is to be observed all points are inside the contour lines. Hence, there are no influential observations. Therefore, we claim that the residuals are independent of each other.
Residuals are uncorrelated: On the basis of the autocorrelation plot, it is to be observed that the lags do not have significant effect because they are within the bounds.
summary(best_model) %>%
broom::glance() %>%
bind_rows(summary(full_model) %>% broom::glance()) %>%
mutate(model = c("Best Model", "Full Model")) %>%
relocate(model)
## # A tibble: 2 × 9
## model r.squared adj.r.squared sigma statistic p.value df df.residual nobs
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int> <dbl>
## 1 Best … 0.669 0.594 140. 8.89 8.90e-8 10 44 55
## 2 Full … 0.798 0.580 142. 3.66 6.78e-4 28 26 55
Cp <- ols_mallows_cp(best_model, full_model)
names(Cp) <- c("Mallows' Cp")
Cp
## Mallows' Cp
## 9.525754
Adjusted R-squared: We compared our model to the full model which includes all main effects of our model with all possible second-order interaction terms of main effects. It is obvious that the full model has higher R-squared (0.797) than that of our model (0.669). However, our model has a higher adjusted R-squared (0.593) than that of the full model (0.579).
Mallows’ Cp: Based on Mallows’ Cp criterion, we want our model to have Cp value less than or equal to the number of parameters, indicating little or no bias in the regression model. we obtained our Cp) from our model and the full model. The Cp value is 9.5 and it is less than the number of our model parameters (11). Therefore, we claim that our regression model has little or no bias.
TODO: Output
summary(best_model) %>%
broom::tidy()
## # A tibble: 11 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 220. 6.18e+2 0.356 7.23e-1
## 2 language_spanish 1446. 2.83e+2 5.12 6.50e-6
## 3 education_bachelors_degree -1386. 1.20e+3 -1.15 2.56e-1
## 4 birthplace_us -2324. 5.51e+2 -4.22 1.20e-4
## 5 health_insurance_public -3614. 7.60e+2 -4.76 2.15e-5
## 6 language_english -365. 5.39e+2 -0.678 5.01e-1
## 7 personal_income -0.0266 8.04e-3 -3.31 1.86e-3
## 8 employment_not_in_labor_force 4345. 9.56e+2 4.54 4.29e-5
## 9 health_insurance_public:personal_in… 0.0812 1.92e-2 4.22 1.19e-4
## 10 education_bachelors_degree:personal… 0.0554 1.87e-2 2.97 4.81e-3
## 11 birthplace_us:language_english 2145. 9.08e+2 2.36 2.26e-2
Following our regression work, we decided to make our insights more actionable by developing a novel scoring method capable of indicating whether a PUMA is at relatively lower or higher risk of achieving some COVID-19-related outcome. To demonstrate the method, we developed a risk score for whether a PUMA is likely to have achieved or not achieved a vaccination rate of 70% among residents, corresponding to the level of population immunity generally considered requisite to “halt the pandemic.” That said, our method is applicable to other outcomes as well, and one might imagine using it to score a given PUMA on the possibility of a hospitalization rate above X%, or a death rate above Y%. We only chose to begin with vaccination rate given our inability to develop a linear regression for this particular outcome, as opposed to hospitalization and death rates.
The risk scoring method we developed is predicated on (regularized, lasso) logistic regression, used most often for classification tasks because predictions can be interpreted as class probabilities. Regularization further prevents over-fitting of the model. After defining our outcome as 1 for below 70% vaccination and 0 for equal to or above 70% vaccination, we converted our set of predictors to matrix form, and then trained a glmnet model on our training data using 5-fold cross-validation repeated 100 times given the large number of predictors compared to our mere 55 PUMA samples, and because we were interested in predicting the risk score of vaccination rate for each PUMA in our data set. Through our lasso regression, we found an optimal lambda tuning parameter of 0.0102, and generated a model prediction accuracy of 0.886 (~87%). Generally, however, obtaining the kappa value averaged over the simulated confusion matrices is a more useful metric for prediction given unbalanced classes (of our 55 PUMAs, 41 achieve vaccination rate >= 70%, and only 14 do not); Our optimal model’s averaged kappa was 0.63, which is considered reasonably decent, but again suggests that the limited number of in-sample data points may result in model over-fitting.
# 1 indicates BELOW 70% vaccination rate
logistic_df = nyc_puma_summary %>%
mutate(
below_herd_vax = as.factor(ifelse(covid_vax_rate >= 70, 0, 1))
) %>%
select(-puma, -total_people, -covid_hosp_rate, -covid_death_rate, -covid_vax_rate)
# Define predictors matrix
x = model.matrix(below_herd_vax ~ ., logistic_df)[,-1]
# Define outcomes
y = logistic_df$below_herd_vax
library(caret) # Note: new libraries are loaded here due to potential function masking from the `caret` library
library(mlbench)
set.seed(777)
vax_cv <- trainControl(method = "repeatedcv", number = 5, repeats = 100,
savePredictions = T
)
# Goal is to find optimal lambda
lasso_model <- train(below_herd_vax ~ ., data = logistic_df,
method = "glmnet",
trControl = vax_cv,
tuneGrid = expand.grid(
.alpha = 1,
.lambda = seq(0.0001, 1, length = 100)),
family = "binomial")
coef <- coef(lasso_model$finalModel, lasso_model$bestTune$lambda)
sub_lasso <-
subset(lasso_model$pred, lasso_model$pred$lambda == lasso_model$bestTune$lambda)
caret::confusionMatrix(table(sub_lasso$pred, sub_lasso$obs))
## Confusion Matrix and Statistics
##
##
## 0 1
## 0 731 258
## 1 369 4142
##
## Accuracy : 0.886
## 95% CI : (0.8773, 0.8943)
## No Information Rate : 0.8
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.6297
##
## Mcnemar's Test P-Value : 1.118e-05
##
## Sensitivity : 0.6645
## Specificity : 0.9414
## Pos Pred Value : 0.7391
## Neg Pred Value : 0.9182
## Prevalence : 0.2000
## Detection Rate : 0.1329
## Detection Prevalence : 0.1798
## Balanced Accuracy : 0.8030
##
## 'Positive' Class : 0
##
Notably, feature selection from our lasso regression discovered that the most important predictors of risk for sub-70% vaccination were insurance composition, employment composition, poverty composition, and welfare composition in a given PUMA – largely aligned with key correlates of vaccination noted in our exploratory analysis.
lambda <- seq(0.0001, 1, length = 100)
lambda_opt = lasso_model$bestTune$lambda
result_plot <- broom::tidy(lasso_model$finalModel) %>%
select(term, lambda, estimate) %>%
complete(term, lambda, fill = list(estimate = 0) ) %>%
filter(term != "(Intercept)") %>%
ggplot(aes(x = log(lambda, 10), y = estimate, group = term, color = term)) +
geom_path() +
geom_vline(xintercept = log(lambda_opt, 10), color = "blue", size = 1.2) +
theme(legend.position = "none") +
labs(y = "Coefficient Estimate", title = "Coefficient Estimates for Varying Values of Lambda")
result_plot
Again, we wanted to move beyond binary classification to develop a risk score. Generally, in logistic regression, when a data point is predicted with probability > 0.5 to be a “1,” the model classifies it as a 1, and otherwise as a 0. We obtained risk scores not only by classifying PUMAs as above or below 70% vaccination rate using our prediction model, but by obtaining the exact probability of a given data point being a 1. For instance, if a PUMA has an 85% chance of being below 70% vaccination rate according to our classifier, its risk score would be 85, even though our model would binarily predict it to be a “1” rather than a “0.”
lambda <- lasso_model$bestTune$lambda
lasso_fit = glmnet(x, y, lambda = lambda, family = "binomial")
risk_predictions = (round((predict(lasso_fit, x, type = "response"))*100, 1))
puma <- nyc_puma_summary %>%
select(puma)
vax <- logistic_df %>%
select(below_herd_vax)
risk_prediction <-
bind_cols(puma, vax, as.vector(risk_predictions)) %>%
rename(risk_prediciton = ...3)
risk_prediction %>%
mutate(puma = fct_reorder(puma, risk_prediciton, .desc = TRUE)) %>%
ggplot(aes(x = puma, y = risk_prediciton, fill = below_herd_vax)) +
geom_bar(stat = "identity") +
geom_hline(yintercept = 50, linetype = "dashed", color = "red") +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1)) +
labs(title = "Predicting Risk Score of Vaccination Rate across PUMA",
x = "PUMA", y = "Predicted Risk Score", fill = "Actual Vax Status") +
scale_fill_viridis(discrete = TRUE, labels = c("Below 70%", "Above 70%"))
We were also curious how statistical learning techniques would cluster our PUMAs based on predictors, and how those clusters might correspond to performance on key outcomes. We separated our data into a tibble of predictors only and a tibble of outcomes only, then fit three clusters – just to try it – on our predictors alone. We proceeded to plot each PUMA on hospitalization rate vs vaccination rate (choosing to forego death rate, in this case, given its collinearity and likely redundancy of hospitalization rate), then colored each data point by predicted clustering.
# Define tibble of predictors only
predictors = nyc_puma_summary %>%
select(-puma, -total_people, -covid_hosp_rate, -covid_vax_rate, -covid_death_rate)
# Define tibble of outcomes only
outcomes = nyc_puma_summary %>%
select(covid_hosp_rate, covid_death_rate, covid_vax_rate)
# Define tibble of pumas only
pumas = nyc_puma_summary %>%
select(puma)
# Fit 3 clusters on predictors
kmeans_fit =
kmeans(x = predictors, centers = 3)
# Add clusters to data frame of predictors and bind with PUMA and outcomes data
predictors =
broom::augment(kmeans_fit, predictors)
# Bind columns
full_df = cbind(pumas, outcomes, predictors)
# Summary df
summary_df = full_df %>%
group_by(.cluster) %>%
summarize(
median_hosp = median(covid_hosp_rate),
median_death = median(covid_death_rate),
median_vax = median(covid_vax_rate)
)
# Plot predictor clusters against outcomes
# Example: try hospitalization vs vaccination
ggplot(data = full_df, aes(x = covid_hosp_rate, y = covid_vax_rate, color = .cluster)) +
geom_point() +
geom_point(data = summary_df, aes(x = median_hosp, y = median_vax), color = "black", size = 4) +
geom_point(data = summary_df, aes(x = median_hosp, y = median_vax, color = .cluster), size = 2.75) +
labs(x = "COVID Hospitalization Rate", y = "COVID Vaccination Rate", title = "COVID Hospitalization vs Vaccination Rates for each PUMA", color = "Cluster")
Generally, our three clusters included:
For fun, we also evaluated Euclidean distance between our observed PUMAs, and alternatively mapped PUMAs onto their respective clusters after reducing our predictors to two key (principal) component dimensions.
# Scale predictors
for_clustering = predictors %>%
select(-.cluster) %>%
na.omit() %>%
scale()
# Evaluate Euclidean distances between observations
distance = get_dist(for_clustering)
fviz_dist(distance, gradient = list(low = "#00AFBB", mid = "white", high = "#FC4E07"))
# Cluster with three centers
k_scaled = kmeans(for_clustering, centers = 3)
# Visualize cluster plot with reduction to two dimensions
fviz_cluster(k_scaled, data = for_clustering)
# Bind with outcomes and color clusters
full_df = for_clustering %>%
as_tibble() %>%
cbind(outcomes, pumas) %>%
mutate(
cluster = k_scaled$cluster
)
Because we settled on setting the number of clusters at three relatively arbitrarily, we decided to evaluate the clustering quality using both WSS, silhouette, and gap methods.
# Check where elbow occurs using WSS method
fviz_nbclust(for_clustering, kmeans, method = "wss")
# Check for optimal number of clusters using silhouette method
fviz_nbclust(for_clustering, kmeans, method = "silhouette")
# Check number of clusters that minimize gap statistic
gap_stat = clusGap(for_clustering, FUN = kmeans, nstart = 25, K.max = 20, B = 50)
fviz_gap_stat(gap_stat)
The elbow in our WSS plot indicates that three clusters may be optimal, whereas the silhouette plot shows an average silhouette width optimized at two clusters. Finally, the gap plot shows that clustering is actually optimized when the number of clusters equals 1 – i.e. when no clustering occurs. The relative lack of concordance between these assessments of clustering quality is no surprise given the fact that our model is fit on only 55 predictors. For completeness, we re-modeled our clustering of predictors with only k = 2 clusters, and again tried to visualize where these clusters of PUMAs were distributed on the hospitalization/vaccination rate axes:
# Scale predictors
for_clustering = predictors %>%
select(-.cluster) %>%
na.omit() %>%
scale()
# Evaluate Euclidean distances between observations
distance = get_dist(for_clustering)
fviz_dist(distance, gradient = list(low = "#00AFBB", mid = "white", high = "#FC4E07"))
# Cluster with two centers
k_scaled2 = kmeans(for_clustering, centers = 2)
# Visualize cluster plot with reduction to two dimensions
fviz_cluster(k_scaled2, data = for_clustering)
# Bind with outcomes and color clusters
full_df = for_clustering %>%
as_tibble() %>%
cbind(outcomes, pumas) %>%
mutate(
cluster = k_scaled2$cluster
)
With two clusters, our low hospitalization/high vaccination cluster remains, but our two mid-level hospitalization rate clusters are condensed into one cluster.
The primary obstacles at the outset of the project were related to data pre-processing, getting our disparate data sources to play nice with each other. As mentioned elsewhere, our outcome variables were obtained at the ZCTA-level geography while our predictor variables were obtained at the interview-level and coded to PUMA geographies. We were able to use a ZCTA-PUMA crosswalk data set from Baruch College and a bit of math to merge these data sets for further analysis.
Another obstacle was to aggregate the interview-level data to the PUMA-level geography and to tidy the IPUMS variables according to the census data dictionary. For instance, factor variables like education and race were coded numerically, but these data needed to have descriptive factor levels for analysis. This process is described in more detail in the Data Sources and Cleaning section.
These data presented a unique regression challenge as well. (TO BE COMPLETED)
In our exploratory analysis we looked at COVID-19 hospitalizations, deaths, and vaccinations across boroughs, PUMAs, and demographic combinations.
When looking at outcomes across boroughs, we saw that PUMA geographies in Queens and the Bronx tended to be above the median NYC PUMA for COVID-19 hospitalization and death rates. Interestingly, PUMA geographies in Queens were among the most highly vaccinated areas in NYC as well.
In general, we did not see notable differences of COVID-19 outcomes between males and females. When looking at demographic categories, we found that hospitalization and death rates were lowest among white males and females under 30 and that vaccination rates were highest among Asian and Pacific Islander males and females. The lowest vaccination rates were among older American Indian individuals and younger and middle-aged Black individuals.
In an analysis of correlations between predictors and outcomes across PUMA geographies, we found income to be strongly associated with COVID-19 hospitalizations (negative association), deaths (negative association), and vaccinations (positive association). We found that many indicators of socioeconomic status – including poverty level, welfare, unemployment, and food stamp utilization – seem highly negatively correlated with vaccination, but not with hospitalization and death. These findings may indicate the potential impact from significant structural inequalities that hinder healthcare access among the poor, given that vaccination is a more “active” outcome (requiring deliberate action) here than hospitalization or death, which are both the result of (in all likelihood) “passive” or indeliberate transmission.
Lastly, we saw that among boroughs, Manhattan had the most significant disparities of outcomes across demographic groups like race and age.
(REGRESSION CONCLUSIONS TO BE ADDED)
Based on our analysis, future research could be performed to better understand the associations we found between socioeconomic status and barriers to healthcare access, shown in the correlations between economic predictors and hospitalization, mortality, and vaccination outcomes. These relationships and the potential causal mechanisms are important to understand in the efforts to emerge from the current COVID-19 pandemic and for future pandemics that we may face.
How can we ensure that diverse populations have access and take advantage of healthcare solutions? Future research can help understand “how to turn vaccines into vaccinations” to quote Dr. Michael Osterholm.